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ABSTRACT 


In this work we present topic diversification, a novel method 
designed to balance and diversify personalized recommenda- 
tion lists in order to reflect the user’s complete spectrum of 
interests. Though being detrimental to average accuracy, we 
show that our method improves user satisfaction with rec- 
ommendation lists, in particular for lists generated using the 
common item-based collaborative filtering algorithm. 

Our work builds upon prior research on recommender sys- 
tems, looking at properties of recommendation lists as en- 
tities in their own right rather than specifically focusing on 
the accuracy of individual recommendations. We introduce 
the intra-list similarity metric to assess the topical diver- 
sity of recommendation lists and the topic diversification 
approach for decreasing the intra-list similarity. We evalu- 
ate our method using book recommendation data, including 
offline analysis on 361,349 ratings and an online study in- 
volving more than 2, 100 subjects. 


Categories and Subject Descriptors 


H.3.3 [Information Storage and Retrieval]: Information 
Retrieval and Search—Jnformation Filtering; 1.2.6 [Artifi- 
cial Intelligence]: Learning—Knowledge Acquisition 


General Terms 


Algorithms, Experimentation, Human Factors, Measurement 


Keywords 


Collaborative filtering, diversification, accuracy, recommend- 
er systems, metrics 


INTRODUCTION 


Recommender systems [23] intend to provide people with 
recommendations of products they will appreciate, based on 
their past preferences, history of purchase, and demographic 
information. Many of the most successful systems make use 
of collaborative filtering [27, 8, 11], and numerous commer- 
cial systems, e.g., Amazon.com’s recommender [16], exploit 
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these techniques to offer personalized recommendation lists 
to their customers. 

Though the accuracy of state-of-the-art collaborative fil- 
tering systems, i.e., the probability that the active user’ will 
appreciate the products recommended, is excellent, some im- 
plications affecting user satisfaction have been observed in 
practice. Thus, on Amazon.com (hittp://www.amazon.com), 
many recommendations seem to be “similar” with respect to 
content. For instance, customers that have purchased many 
of Hermann Hesse’s prose may happen to obtain recom- 
mendation lists where all top-5 entries contain books by 
that respective author only. When considering pure accu- 
racy, all these recommendations appear excellent since the 
active user clearly appreciates books written by Hermann 
Hesse. On the other hand, assuming that the active user 
has several interests other than Hermann Hesse, e.g., his- 
torical novels in general and books about world travel, the 
recommended set of items appears poor, owing to its lack of 
diversity. 

Traditionally, recommender system projects have focused 
on optimizing accuracy using metrics such as precision/recall 
or mean absolute error. Now research has reached the point 
where going beyond pure accuracy and toward real user ex- 
perience becomes indispensable for further advances [10]. 

This work looks specifically at impacts of recommendation 
lists, regarding them as entities in their own right rather 
than mere aggregations of single and independent sugges- 
tions. 


1.1 Contributions 


We address the afore-mentioned deficiencies by focusing 
on techniques that are centered on real user satisfaction 
rather than pure accuracy. The contributions we make in 
this paper are the following: 


e Topic diversification. We propose an approach to- 
wards balancing top-N recommendation lists accord- 
ing to the active user’s full range of interests. Our novel 
method takes into consideration both the accuracy of 
suggestions made, and the user’s extent of interest in 
specific topics. Analyses of topic diversification’s im- 
plications on user-based [11, 22] and item-based [26, 
5] collaborative filtering are provided. 


'The term “active user” refers to the person for whom rec- 
ommendations are made. 


e Intra-list similarity metric. Regarding diversity as 
an important ingredient to user satisfaction, metrics 
able to measure that characteristic feature are required. 
We propose the intra-list similarity metric as an effi- 
cient means for measurement, complementing existing 
accuracy metrics in their efforts to capture user satis- 
faction. 


e Accuracy versus satisfaction. There have been sev- 
eral efforts in the past arguing that “accuracy does not 
tell the whole story” [4, 12]. Nevertheless, no evidence 
has been given to show that some aspects of actual 
user satisfaction reach beyond accuracy. We close this 
gap and provide analysis from large-scale online and 
offline evaluations, matching results obtained from ac- 
curacy metrics against actual user satisfaction and in- 
vestigating interactions and deviations between both 
concepts. 


1.2 Organization 


Our paper is organized as follows. We discuss collabora- 
tive filtering and its two most prominent implementations 
in Section 2. The subsequent section then briefly reports on 
common evaluation metrics and the new intra-list similarity 
metric. In Section 4, we present our method for diversify- 
ing lists, describing its primary motivation and algorithmic 
clockwork. Section 5 reports on our offline and online exper- 
iments with topic diversification and provides ample discus- 
sion of results obtained. 


2. ON COLLABORATIVE FILTERING 


Collaborative filtering (CF) still represents the most com- 
monly adopted technique in crafting academic and commer- 
cial [16] recommender systems. Its basic idea refers to mak- 
ing recommendations based upon ratings that users have 
assigned to products. Ratings can either be explicit, i.e., by 
having the user state his opinion about a given product, or 
implicit, when the mere act of purchasing or mentioning of 
an item counts as an expression of appreciation. While im- 
plicit ratings are generally more facile to collect, their usage 
implies adding noise to the collected information [20]. 


2.1 User-based Collaborative Filtering 


User-based CF has been explored in-depth during the last 
ten years [29, 24, 14] and represents the most popular recom- 
mendation algorithm [11], owing to its compelling simplicity 
and excellent quality of recommendations. 

CF operates on a set of users A = {a1, a2, . . . , an }, a set of 
products B = {bj,bo,...,bm}, and partial rating functions 
ri : B= [-1, +1)t+ for each user a;. Negative values r; (bx) 
denote utter dislike, while positive values express a;’s liking 
of product bx. If ratings are implicit only, we represent them 
by set Ri C B, equivalent to {bp € B | ri(bk) Æ L}. 

The user-based CF’s working process can be broken down 
into two major steps: 


e Neighborhood formation. Assuming a; as the ac- 
tive user, similarity values c(a;,aj) € [—1,+1] for all 
a; E€ A\{a;} are computed, based upon the similarity 
of their respective rating functions r;,r;. In general, 
Pearson correlation [29, 8] or cosine distance [11] are 
used for computing c(ai,a;). The top-M most sim- 
ilar users a; become members of a;’s neighborhood, 
clique(a;) C A. 


e Rating prediction. Taking all the products bę that 
a;’s neighbors a; € clique(a;) have rated and which are 
new to aj, i.e., ri(bx) = L, a prediction of liking w; (bz) 
is produced. Value w;(bx) hereby depends on both the 
similarity c(a:,a;) of voters a; with r; (bx) Æ L, as well 
as the ratings r;(b;) these neighbors a; assigned to bx. 


Eventually, a list Py, : {1,2,..., N} — B of top-N recom- 
mendations is computed, based upon predictions w;. Note 
that function Pw; is injective and reflects recommendation 
ranking in descending order, giving highest predictions first. 


2.2 Item-based Collaborative Filtering 


Item-based CF [13, 26, 5] has been gaining momentum 
over the last five years by virtue of favorable computational 
complexity characteristics and the ability to decouple the 
model computation process from actual prediction making. 
Specifically for cases where |A| >> |B|, item-based CF’s com- 
putational performance has been shown superior to user- 
based CF [26]. Its success also extends to many commercial 
recommender systems, such as Amazon.com’s [16]. 

As with user-based CF, recommendation making is based 
upon ratings ri(be) that users a; E€ A provided for products 
bk € B. However, unlike user-based CF, similarity values c 
are computed for items rather than users, hence c: Bx B —> 
[—1, +1]. Roughly speaking, two items bg, be are similar, i.e., 
have large c(by, be), if users who rate one of them tend to 
rate the other, and if users tend to assign them identical 
or similar ratings. Moreover, for each bg, its neighborhood 
clique(bk) C B of top-M most similar items is defined. 

Predictions w;(bx) are computed as follows: 


Žo. c Bt, (Clbk, be) : ri(be)) 


Wi b = ’ 
a oS 


where 
Br := {be | be € clique(be) A ri(be) A L} 


Intuitively, the approach tries to mimic real user behav- 
ior, having user a; judge the value of an unknown product 
bk by comparing the latter to known, similar items be and 
considering how much a; appreciated these be. 

The eventual computation of a top-N recommendation 
list Pw; follows the user-based CF’s process, arranging rec- 
ommendations according to w; in descending order. 


3. EVALUATION METRICS 


Evaluation metrics are essential in order to judge the qual- 
ity and performance of recommender systems, even though 
they are still in their infancies. Most evaluations concentrate 
on accuracy measurements only and neglect other factors, 
e.g., novelty and serendipity of recommendations, and the 
diversity of the recommended list’s items. 

The following sections give an outline of popular metrics. 
An extensive survey of accuracy metrics is provided in [12]. 


3.1 Accuracy Metrics 


Accuracy metrics have been defined first and foremost for 
two major tasks: 

First, to judge the accuracy of single predictions, i.e., how 
much predictions wi (bx) for products bx deviate from a;’s ac- 
tual ratings r;(b,). These metrics are particularly suited for 


tasks where predictions are displayed along with the prod- 
uct, e.g., annotation in context [12]. 

Second, decision-support metrics evaluate the effective- 
ness of helping users to select high-quality items from the 
set of all products, generally supposing binary preferences. 


3.1.1 Predictive Accuracy Metrics 


Predictive accuracy metrics measure how close predicted 
ratings come to true user ratings. Most prominent and widely 
used [29, 11, 3, 9], mean absolute error (MAE) represents an 
efficient means to measure the statistical accuracy of predic- 
tions w;(b,) for sets B; of products: 


Zo, c p, [Ti(be) — wi (br )| 
A (2) 


Related to MAE, mean squared error (MSE) squares the 
error before summing. Hence, large errors become much 
more pronounced than small ones. 

Very easy to implement, predictive accuracy metrics are 
inapt for evaluating the quality of top-N recommendation 
lists. Users only care about errors for high-rank products. 
On the other hand, prediction errors for low-rank products 
are unimportant, knowing that the user has no interest in 
them anyway. However, MAE and MSE account for both 
types of errors in exactly the same fashion. 


|E| = 


3.1.2 Decision-Support Metrics 


Precision and recall, both well-known from information 
retrieval, do not consider predictions and their deviations 
from actual ratings. They rather judge how relevant a set of 
ranked recommendations is for the active user. 

Before using these metrics for cross-validation, K-folding 
is applied, dividing every user a;’s rated products bẹ € Ri 
into K disjoint slices of preferably equal size. Hereby, K — 1 
randomly chosen slices form a;’s training set R7. These rat- 
ings then define a,’s profile from which final recommenda- 
tions are computed. For recommendation generation, a;’s 
residual slice (R; \ R7) is retained and not used for predic- 
tion. This slice, denoted T7, constitutes the test set, i.e., 
those products the recommenders intend to predict. 

Sarwar [25] presents an adapted variant of recall, record- 
ing the percentage of test set products b € T7 occurring in 
recommendation list P7 with respect to the overall number 
of test set products |T}"|: 


ITE ASP? | 
|T?| 


Recall = 100 - (3) 
Symbol SP? denotes the image of map Př, i.e., all items 
part of the recommendation list. 
Accordingly, precision represents the percentage of test 
set products b € T7 occurring in P;” with respect to the size 
of the recommendation list: 


ITZ n SPF] 


Precision = 100 - SP7] 


(4) 


Breese et al. [3] introduce an interesting extension to re- 
call, known as weighted recall or Breese score. The approach 
takes into account the order of the top-N list, penalizing in- 
correct recommendations less severely the further down the 
list they occur. Penalty decreases with exponential decay. 
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Other popular decision-support metrics include ROC [28, 
18, 9], the “receiver operating characteristic”. ROC mea- 
sures the extent to which an information filtering system is 
able to successfully distinguish between signal and noise. 
Less frequently used, NDPM [2] compares two different, 
weakly ordered rankings. 


3.2 Beyond Accuracy 


Though accuracy metrics are an important facet of useful- 
ness, there are traits of user satisfaction they are unable to 
capture. However, non-accuracy metrics have largely been 
denied major research interest so far. 


3.2.1 Coverage 


Among all non-accuracy evaluation metrics, coverage has 
been the most frequently used [11, 19, 9]. Coverage measures 
the percentage of elements part of the problem domain for 
which predictions can be made. 


3.2.2 Novelty and Serendipity 


Some recommenders produce highly accurate results that 
are still useless in practice, e.g., suggesting bananas to cus- 
tomers in grocery stores. Though being highly accurate, note 
that almost everybody likes and buys bananas. Hence, their 
recommending appears far too obvious and of little help to 
the shopper. 

Novelty and serendipity metrics thus measure the “non- 
obviousness” of recommendations made, avoiding “cherry- 
picking” [12]. For some simple measure of serendipity, take 
the average popularity of recommended items. Lower scores 
obtained denote higher serendipity. 


3.3 Intra-List Similarity 


We present a new metric that intends to capture the diver- 
sity of a list. Hereby, diversity may refer to all kinds of fea- 
tures, e.g., genre, author, and other discerning characteris- 
tics. Based upon an arbitrary function co : Bx B > [-1, +1] 
measuring the similarity co(by, be) between products bx, be 
according to some custom-defined criterion, we define intra- 
list similarity for a;’s list Pw, as follows: 


a ` ) 3 Co (bx, be) 
bp €SPw, be € SPw,, be #be 


: (5) 


Higher scores denote lower diversity. An interesting math- 
ematical feature of ILS(Pw,;) we are referring to in later sec- 
tions is permutation-insensitivity, i.e., let Sw be the sym- 
metric group of all permutations on N = |Pw,| symbols: 


ILS(Pu,) = 


Voi, oj E€ Sw : ILS(Pw,; © 01) = ILS( Pu; 0 03) (6) 


Hence, simply rearranging positions of recommendations 
in a top-N list Pu, does not affect P,,’s intra-list similarity. 


4. TOPIC DIVERSIFICATION 


One major issue with accuracy metrics is their inability 
to capture the broader aspects of user satisfaction, hiding 
several blatant flaws in existing systems [17]. For instance, 
suggesting a list of very similar items, e.g., with respect to 
the author, genre, or topic, may be of little use for the user, 
even though this list’s average accuracy might be high. 

The issue has been perceived by other researchers before, 
coined “portfolio effect” by Ali and van Stam [1]. We believe 


that item-based CF systems in particular are susceptible to 
that effect. Reports from the item-based TV recommender 
TiVo [1], as well as personal experiences with Amazon.com’s 
recommender, also item-based [16], back our conjecture. For 
instance, one of this paper’s authors only gets recommenda- 
tions for Heinlein’s books, another complained about all his 
suggested books being Tolkien’s writings. 

Reasons for negative ramifications on user satisfaction im- 
plied by portfolio effects are well-understood and have been 
studied extensively in economics, termed “law of diminishing 
marginal returns” [30]. The law describes effects of satura- 
tion that steadily decrease the incremental utility of prod- 
ucts p when acquired or consumed over and over again. For 
example, suppose you are offered your favorite drink. Let 
pı denote the price you are willing to pay for that product. 
Assuming your are offered a second glass of that particular 
drink, the amount p2 of money you are inclined to spend 
will be lower, i.e., pı > p2. Same for p3, pa, and so forth. 

We propose an approach we call topic diversification to 
deal with the problem at hand and make recommended lists 
more diverse and thus more useful. Our method represents 
an extension to existing recommender algorithms and is ap- 
plied on top of recommendation lists. 


4.1 Taxonomy-based Similarity Metric 
Function c* : 2? x 2? — [—1,+1], quantifying the simi- 
larity between two product sets, forms an essential part of 
topic diversification. We instantiate c* with our metric for 
taxonomy-driven filtering [33], though other content-based 
similarity measures may appear likewise suitable. Our met- 
ric computes the similarity between product sets based upon 
their classification. Each product belongs to one or more 
classes that are hierarchically arranged in classification tax- 
onomies, describing the products in machine-readable ways. 
Classification taxonomies exist for various domains. Ama- 
zon.com crafts very large taxonomies for books, DVDs, CDs, 
electronic goods, and apparel. See Figure 1 for one sam- 
ple taxonomy. Moreover, all products on Amazon.com bear 
content descriptions relating to these domain taxonomies. 
Featured topics could include author, genre, and audience. 


4.2 Topic Diversification Algorithm 


Algorithm 1 shows the complete topic diversification algo- 
rithm, a brief textual sketch is given in the next paragraphs. 

Function Pw;» denotes the new recommendation list, re- 
sulting from applying topic diversification. For every list en- 
try z € [2, N], we collect those products b from the candidate 
products set B; that do not occur in positions o < z in Py,» 
and compute their similarity with set {Pw,(k) | k € [1, 2[}, 
which contains all new recommendations preceding rank z. 

Sorting all products b according to c*(b) in reverse order, 
we obtain the dissimilarity rank PY. This rank is then 
merged with the original recommendation rank Pu, accord- 
ing to diversification factor Or, yielding final rank Py,.. 
Factor Or defines the impact that dissimilarity rank P.Y 
exerts on the eventual overall output. Large Or € [0.5, 1] fa- 
vors diversification over a;’s original relevance order, while 
low Or € [0,0.5[ produces recommendation lists closer to 
the original rank P,,,. For experimental analysis, we used 
diversification factors Or € [0, 0.9]. 

Note that ordered input lists Pu; must be considerably 
larger than the final top-N list. For our later experiments, we 
used top-50 input lists for eventual top-10 recommendations. 
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procedure diversify (Pw,, Or) { 
Bi — SPw,; Pw;«(1) — Pw; (1); 
for z — 2 to N do 
set Bi — B; \ {Pu,»(k) | k € [1, z[}; 
Yb € B’: compute c*({b}, {Pu;+(k) | k € [1, z[}); 
compute P» : {1,2,...,|Bi|} — B; using c*; 
for all b € B; do 
e (6) — [Bil — P30); 
w7 (b) — Pa? (b) (1 — Or) + PEY (O) Or; 
end do 
Pu;+(z2) — min{w7(b) | b € By}; 
end do 


return Py, +; 


} 


Algorithm 1: Sequential topic diversification 


4.3 Recommendation Dependency 


In order to implement topic diversification, we assume 
that recommended products Py,;(o) and Pw,;(p), o,p € N, 
along with their content descriptions, effectively do exert an 
impact on each other, which is commonly ignored by ex- 
isting approaches: usually, only relevance weight ordering 
o < p > wi(Pu,(0)) > wi(Pw;(p)) must hold for recommen- 
dation list items, no other dependencies are assumed. 

In case of topic diversification, recommendation interde- 
pendence means that an item b’s current dissimilarity rank 
with respect to preceding recommendations plays an impor- 
tant role and may influence the new ranking. 


4.4 Osmotic Pressure Analogy 


The effect of dissimilarity bears traits similar to that of os- 
motic pressure and selective permeability known from molec- 
ular biology [31]. Steady insertion of products bo, taken from 
one specific area of interest do, into the recommendation 
list equates to the passing of molecules from one specific 
substance through the cell membrane into cytoplasm. With 
increasing concentration of do, owing to the membrane’s se- 
lective permeability, the pressure for molecules b from other 
substances d rises. When pressure gets sufficiently high for 
one given topic dp, its best products bp may “diffuse” into 
the recommendation list, even though their original rank 
P3} (b) might be inferior to candidates from the prevailing 
domain do. Consequently, pressure for dp decreases, paving 
the way for another domain for which pressure peaks. 

Topic diversification hence resembles the membrane’s se- 
lective permeability, which allows cells to maintain their in- 
ternal composition of substances at required levels. 


5. EMPIRICAL ANALYSIS 


We conducted offline evaluations to understand the ram- 
ifications of topic diversification on accuracy metrics, and 
online analysis to investigate how our method affects ac- 
tual user satisfaction. We applied topic diversification with 
Or € {0,0.1,0.2,...0.9} to lists generated by both user- 
based CF and item-based CF, observing effects that occur 
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Figure 1: Fragment from the Amazon.com book taxonomy 


when steadily increasing Op and analyzing how both ap- 
proaches respond to diversification. 


5.1 Dataset Design 


We based online and offline analyses on data we gathered 
from BookCrossing (http://www. bookcrossing.com). The lat- 
ter community caters for book lovers exchanging books all 
around the world and sharing their experiences with others. 


5.1.1 Data Collection 


In a 4-week crawl, we collected data on 278,858 members 
of BookCrossing and 1,157,112 ratings, both implicit and 
explicit, referring to 271,379 distinct ISBNs. Invalid ISBNs 
were excluded from the outset. 

The complete BookCrossing dataset, featuring fully anon- 
ymized information, is available via the first author’s home- 
page (http://www.informatik. uni-freiburg. de/~ cziegler). 

Next, we mined Amazon.com’s book taxonomy, compris- 
ing 13,525 distinct topics. In order to be able to apply topic 
diversification, we mined content information, focusing on 
taxonomic descriptions that relate books to taxonomy nodes 
from Amazon.com. Since many books on BookCrossing refer 
to rare, non-English books, or outdated titles not in print 
anymore, we were able to garner background knowledge for 
only 175, 721 books. In total, 466, 573 topic descriptors were 
found, giving an average of 2.66 topics per book. 


5.1.2 Condensation Steps 


Owing to the BookCrossing dataset’s extreme sparsity, we 
decided to further condense the set in order to obtain more 
meaningful results from CF algorithms when computing rec- 
ommendations. Hence, we discarded all books missing taxo- 
nomic descriptions, along with all ratings referring to them. 
Next, we also removed book titles with fewer than 20 overall 
mentions. Only community members with at least 5 ratings 
each were kept. 

The resulting dataset’s dimensions were considerably more 
moderate, featuring 10, 339 users, 6, 708 books, and 361, 349 
book ratings. 


5.2 Offline Experiments 


We performed offline experiments comparing precision, re- 
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call, and intra-list similarity scores for 20 different recom- 
mendation list setups. Half these recommendation lists were 
based upon user-based CF with different degrees of diver- 
sification, the others on item-based CF. Note that we did 
not compute MAE metric values since we are dealing with 
implicit rather than explicit ratings. 


5.2.1 Evaluation Framework Setup 


For cross-validation of precision and recall metrics of all 
10, 339 users, we adopted K-folding with parameter K = 4. 
Hence, rating profiles R; were effectively split into training 
sets Rọ and test sets T7,x € {1,...,4}, at a ratio of 3: 1. 
For each of the 41,356 different training sets, we computed 
20 top-10 recommendation lists. 

To generate the diversified lists, we computed top-50 lists 
based upon pure, i.e., non-diversified, item-based CF and 
pure user-based CF. The high-performance SUGGEST recom- 
mender engine? was used to compute these base case lists. 
Next, we applied the diversification algorithm to both base 
cases, applying Or factors ranging from 10% up to 90%. For 
evaluation, all lists were truncated to contain 10 books only. 


5.2.2 Result Analysis 


We were interested in seeing how accuracy, captured by 
precision and recall, behaves when increasing Or from 0.1 up 
to 0.9. Since topic diversification may make books with high 
predicted accuracy trickle down the list, we hypothesized 
that accuracy will deteriorate for Or — 0.9. Moreover, in 
order to find out if our novel algorithm has any significant, 
positive effects on the diversity of items featured, we also 
applied our intra-list similarity metric. An overlap analysis 
for diversified lists, Or > 0.1, versus their respective non- 
diversified pendants indicates how many items stayed the 
same for increasing diversification factors. 


5.2.2.1 Precision and Recall. 

First, we analyzed precision and recall scores for both non- 
diversified base cases, i.e., when Or = 0. Table 1 states that 
user-based and item-based CF exhibit almost identical accu- 
racy, indicated by precision values. Their recall values differ 
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Figure 2: Precision (a) and recall (b) for increasing Op 
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Precision 3.64 3.69 
Recall 7.32 5.76 


Table 1: Precision/recall for non-diversified CF 


considerably, hinting at deviating behavior with respect to 
the types of users they are scoring for. 

Next, we analyzed the behavior of user-based and item- 
based CF when steadily increasing Or by increments of 10%, 
depicted in Figure 2. The two charts reveal that diversifica- 
tion has detrimental effects on both metrics and on both CF 
algorithms. Interestingly, corresponding precision and recall 
curves have almost identical shape. 

The loss in accuracy is more pronounced for item-based 
than for user-based CF. Furthermore, for either metric and 
either CF algorithm, the drop is most distinctive for Or € 
[0.2, 0.4]. For lower Or, negative impacts on accuracy are 
marginal. We believe this last observation due to the fact 
that precision and recall are permutation-insensitive, i.e., 
the mere order of recommendations within a top-N list does 
not influence the metric value, as opposed to Breese score [3, 
12]. However, for low Or, the pressure that the dissimilarity 
rank exerts on the top-N list’s makeup is still too weak to 
make many new items diffuse into the top-N list. Hence, we 
conjecture that rather the positions of current top-N items 
change, which does not affect either precision or recall. 


5.2.2.2 Intra-List Similarity. 

Knowing that our diversification method exerts a signif- 
icant, negative impact on accuracy metrics, we wanted to 
know how our approach affected the intra-list similarity mea- 
sure. Similar to the precision and recall experiments, we 
computed metric values for user-based and item-based CF 
with Or € [0,0.9] each. Hereby, we instantiated the intra- 
list similarity metric function co with our taxonomy-driven 
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metric c*. Results obtained from intra-list similarity analysis 
are given in Figure 3(a). 

The topic diversification method considerably lowers the 
pairwise similarity between list items, thus making top-N 
recommendation lists more diverse. Diversification appears 
to affect item-based CF stronger than its user-based coun- 
terpart, in line with our findings about precision and recall. 
For lower Of, curves are less steep than for Or € [0.2, 0.4], 
which also well aligns with precision and recall analysis. 
Again, the latter phenomenon can be explained by one of 
the metric’s inherent features, i.e., like precision and recall, 
intra-list similarity is permutation-insensitive. 


5.2.2.3 Original List Overlap. 

Figure 3(b) shows the number of recommended items stay- 
ing the same when increasing Op with respect to the original 
list’s content. Both curves exhibit roughly linear shapes, be- 
ing less steep for low Or, though. Interestingly, for factors 
Or < 0.4, at most 3 recommendations change on average. 


5.2.2.4 Conclusion. 

We found that diversification appears detrimental to both 
user-based and item-based CF along precision and recall 
metrics. In fact, this outcome aligns with our expectations, 
considering the nature of those two accuracy metrics and 
the way that the topic diversification method works. More- 
over, we found that item-based CF seems more susceptible 
to topic diversification than user-based CF, backed by re- 
sults from precision, recall and intra-list similarity metric 
analysis. 


5.3 Online Experiments 


Offline experiments helped us in understanding the impli- 
cations of topic diversification on both CF algorithms. We 
could also observe that the effects of our approach are dif- 
ferent on different algorithms. However, knowing about the 
deficiencies of accuracy metrics, we wanted to assess actual 
user satisfaction for various degrees of diversification, thus 
necessitating an online survey. 

For the online study, we computed each recommendation 
list type anew for users in the denser BookCrossing dataset, 
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Figure 3: Intra-list similarity behavior (a) and overlap with original list (b) for increasing Op 


though without K-folding. In cooperation with BookCross- 
ing, we mailed all eligible users via the community mailing 
system, asking them to participate in our online study. Each 
mail contained a personal link that would direct the user to 
our online survey pages. In order to make sure that only 
the users themselves would complete their survey, links con- 
tained unique, encrypted access codes. 

During the 3-week survey phase, 2,125 users participated 
and completed the study. 


5.3.1 Survey Outline and Setup 


The survey consisted of several screens that would tell 
the prospective participant about this study’s nature and 
his task, show all his ratings used for making recommen- 
dations, and finally present a top-10 recommendation list, 
asking several questions thereafter. 

For each book, users could state their interest on a 5-point 
rating scale. Scales ranged from “not much” to “very much”, 
mapped to values 1 to 4, and offered the user to indicate that 
he had already read the book, mapped to value 5. In order to 
successfully complete the study, users were not required to 
rate all their top-10 recommendations. Neutral values were 
assumed for non-votes instead. However, we required users 
to answer all further questions, concerning the list as a whole 
rather than its single recommendations, before submitting 
their results. We embedded those questions we were actually 
keen about knowing into ones of lesser importance, in order 
to conceal our intentions and not bias users. 

The one top-10 recommendation list for each user was cho- 
sen among 12 candidate lists, either user-based CF or item- 
based with Or € {0, 0.3, 0.4, 0.5, 0.7, 0.9} each. We opted for 
those 12 instead of all 20 list types in order to acquire enough 
users completing the survey for each slot. The assignment 
of a specific list to the current user was done dynamically, 
at the time of the participant entering the survey, and in 
a round-robin fashion. Thus, we could guarantee that the 
number of users per list type was roughly identical. 


5.3.2 Result Analysis 


For the analysis of our inter-subject survey, we were mostly 
interested in the following three aspects. First, the average 
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rating users gave to their 10 single recommendations. We 
expected results to roughly align with scores obtained from 
precision and recall, owing to the very nature of these met- 
rics. Second, we wanted to know if users perceived their list 
as well-diversified, asking them to tell whether the lists re- 
flected rather a broad or narrow range of their reading inter- 
ests. Referring to the intra-list similarity metric, we expected 
users’ perceived range of topics, i.e., the list’s diversity, to 
increase with increasing Or. Third, we were curious about 
the overall satisfaction of users with their recommendation 
lists in their entirety, the measure to compare performance. 

Both latter-mentioned questions were answered by each 
user on a 5-point likert scale, higher scores denoting better 
performance, and we averaged the eventual results by the 
number of users. Statistical significance of all mean values 
was measured by parametric one-factor ANOVA, where p < 
0.05 if not indicated otherwise. 


5.3.2.1  Single-Vote Averages. 

Users perceived recommendations made by user-based CF 
systems on average as more accurate than those made by 
item-based CF systems, as depicted in Figure 4(a). At each 
featured diversification level Or, differences between the two 
CF types are statistically significant, p « 0.01. 

Moreover, for each algorithm, higher diversification fac- 
tors obviously entail lower single-vote average scores, which 
confirms our hypothesis stated before. The item-based CF’s 
cusp at Or € [0.3,0.5] appears as a notable outlier, op- 
posed to the trend, but differences between the 3 means at 
Or € [0.3,0.5] are not statistically significant, p > 0.15. 
Contrarily, differences between all factors Or are significant 
for item-based CF, p < 0.01, and for user-based CF, p < 0.1. 

Hence, topic diversification negatively correlates with pure 
accuracy. Besides, users perceived the performance of user- 
based CF as significantly better than item-based CF for all 
corresponding levels Or. 


5.3.2.2 Covered Range. 

Next, we analyzed whether users actually perceived the 
variety-augmenting effects caused by topic diversification, 
illustrated before through the measurement of intra-list sim- 
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Figure 4: Results for single-vote averages (a), covered range of interests (b), and overall satisfaction (c) 


ilarity. Users’ reactions to steadily incrementing Op are il- 
lustrated in Figure 4(b). First, between both algorithms on 
corresponding Opr levels, only the difference of means at 
Or = 0.3 shows statistical significance. 

Studying the trend of user-based CF for increasing Or, we 
notice that the perceived range of reading interests covered 
by users’ recommendation lists also increases. Hereby, the 
curve’s first derivative maintains an approximately constant 
level, exhibiting slight peaks between Or € [0.4, 0.5]. Statis- 
tical significance holds for user-based CF between means at 
Or = 0 and Of > 0.5, and between Or = 0.3 and Or = 0.9. 

On the contrary, the item-based curve exhibits a drasti- 
cally different behavior. While soaring at Or = 0.3 to 3.186, 
reaching a score almost identical to the user-based CF’s peak 
at Or = 0.9, the curve barely rises for Or € [0.4,0.9], 
remaining rather stable and showing a slight, though in- 
significant, upward trend. Statistical significance was shown 
for Or = 0 with respect to all other samples taken from 
Or € [0.3,0.9]. Hence, our online results do not perfectly 
align with findings obtained from offline analysis. While the 
intra-list similarity chart in Figure 3 indicates that diversity 
increases when increasing Or, the item-based CF chart de- 
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fies this trend, first soaring then flattening. We conjecture 
that the following three factors account for these peculiari- 
ties: 


e Diversification factor impact. Our offline analysis 
of the intra-list similarity already suggested that the 
effect of topic diversification on item-based CF is much 
stronger than on user-based CF. Thus, the item-based 
CF’s user-perceived interest coverage is significantly 
higher at Or = 0.3 than the user-based CF’s. 


Human perception. We believe that human percep- 
tion can capture the level of diversification inherent 
to a list only to some extent. Beyond that point, in- 
creasing diversity remains unnoticed. For the appli- 
cation scenario at hand, Figure 4 suggests this point 
around score value 3.2, reached by user-based CF only 
at Or = 0.9, and approximated by item-based CF al- 
ready at Or = 0.3. 


Interaction with accuracy. Analyzing results ob- 
tained, bear in mind that covered range scores are not 


fully independent from single-vote averages. When ac- 
curacy is poor, i.e., the user feels unable to identify 
recommendations that are interesting to him, chances 
are high his discontentment will also negatively affect 
his diversity rating. For Or € [0.5, 0.9], single-vote av- 
erages are remarkably low, which might explain why 
perceived coverage scores do not improve for increasing 
Or. 


However, we may conclude that users do perceive the ap- 
plication of topic diversification as an overly positive effect 
on reading interest coverage. 


5.3.2.3 Overall List Value. 

The third feature variable we were evaluating, the overall 
value users assigned to their personal recommendation list, 
effectively represents the target value of our studies, mea- 
suring actual user satisfaction. Owing to our conjecture that 
user satisfaction is a mere composite of accuracy and other 
influential factors, such as the list’s diversity, we hypothe- 
sized that the application of topic diversification would in- 
crease satisfaction. At the same time, considering the down- 
ward trend of precision and recall for increasing Opr, in ac- 
cordance with declining single-vote averages, we expected 
user satisfaction to drop off for large Or. Hence, we sup- 
posed an arc-shaped curve for both algorithms. 

Results for overall list value are given in Figure 4(c). Ana- 
lyzing user-based CF, we observe that the curve does not fol- 
low our hypothesis. Slightly improving at Or = 0.3 over the 
non-diversified case, scores drop for Or € [0.4, 0.7], eventu- 
ally culminating in a slight but visible upturn at Or = 0.9. 
While lacking reasonable explanations and being opposed 
to our hypothesis, the curve’s data-points de facto bear no 
statistical significance for p < 0.1. Hence, we conclude that 
topic diversification has a marginal, largely negligible impact 
on overall user satisfaction, initial positive effects eventually 
being offset by declining accuracy. 

On the contrary, for item-based CF, results obtained look 
different. In compliance with our previous hypothesis, the 
curve’s shape roughly follows an arc, peaking at Or = 0.4. 
Taking the three data-points defining the arc, we obtain sta- 
tistical significance for p < 0.1. Since the endpoint’s score at 
Or = 0.9 is inferior to the non-diversified case’s, we observe 
that too much diversification appears detrimental, perhaps 
owing to substantial interactions with accuracy. 

Eventually, for overall list value analysis, we come to con- 
clude that topic diversification has no measurable effects 
on user-based CF, but significantly improves item-based CF 
performance for diversification factors Or around 40%. 


5.4 Multiple Linear Regression 


Results obtained from analyzing user feedback along var- 
ious feature axes already indicated that users’ overall satis- 
faction with recommendation lists not only depends on ac- 
curacy, but also on the range of reading interests covered. 
In order to more rigidly assess that indication by means of 
statistical methods, we applied multiple linear regression to 
our survey results, choosing the overall list value as depen- 
dent variable. As independent input variables, we provided 
single-vote averages and covered range, both appearing as 
first-order and second-order polynomials, i.e., SVA and CR, 
and SVA? and CR?, respectively. We also tried several other, 
more complex models, without achieving significantly better 
model fitting. 
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Estimate Error t-Value  Pr(> |t|) 
(const) 3.27 0.023 139.56 <2e—16 
SVA 12.42 0.973 12.78 < 2e — 16 
SVA? -6.11 0.976 -6.26 4.76e— 10 
CR 19.19 0.982 19.54 < 2e — 16 
CR? -3.27 0.966 -3.39 0.000727 


Multiple R?: 0.305, adjusted R?: 0.303 


Table 2: Multiple linear regression results 


Analyzing multiple linear regression results, shown in Ta- 
ble 2, confidence values Pr(> |t|) clearly indicate that sta- 
tistically significant correlations for accuracy and covered 
range with user satisfaction exist. Since statistical signifi- 
cance also holds for their respective second-order polynomi- 
als, i.e., CR? and SVA?, we conclude that these relationships 
are non-linear and more complex, though. 

As a matter of fact, linear regression delivers a strong in- 
dication that the intrinsic utility of a list of recommended 
items is more than just the average value of accuracy votes 
for all single items, but also depends on the perceived diver- 
sity. 


5.5 Limitations 


There are some limitations to the study, notably referring 
to the way topic diversification was implemented. Though 
the Amazon.com taxonomies were human-created, there may 
still be some mismatch between what the topic diversifica- 
tion algorithm perceives as “diversified” and what humans 
do. The issue is effectively inherent to the taxonomy’s struc- 
ture, which has been designed with browsing tasks and ease 
of searching rather than with interest profile generation in 
mind. For instance, the taxonomy features topic nodes la- 
belled with letters for alphabetical ordering of authors from 
the same genre, e.g., BOOKS — FICTION > ... — AUTHORS, 
A-Z — G. Hence, two Sci-Fi books from two different au- 
thors with the same initial of their last name would be classi- 
fied under the same node, while another Sci-Fi book from an 
author with a different last-name initial would not. Though 
the problem’s impact is largely marginal, owing to the rel- 
atively deep level of nesting where such branchings occur, 
the procedure appears far from intuitive. 

An alternative approach to further investigate the accu- 
racy of taxonomy-driven similarity measurement, and its 
limitations, would be to have humans do the clustering, e.g., 
by doing card sorts or by estimating the similarity of any 
two books contained in the book database. The results could 
then be matched against the topic diversification method’s 
output. 


6. RELATED WORK 


Few efforts have addressed the problem of making top-N 
lists more diverse. Only considering literature on collabo- 
rative filtering and recommender systems in general, none 
have been presented before, to the best of our knowledge. 

However, some work related to our topic diversification 
approach can be found in information retrieval, specifically 


meta-search engines. A critical aspect of meta-search engine 
design is the merging of several top-N lists into one single 
top-N list. Intuitively, this merged top-N list should reflect 
the highest quality ranking possible, also known as the “rank 
aggregation problem” [6]. Most approaches use variations of 
the “linear combination of score” model (LC), described by 
Vogt and Cottrell [32]. The LC model effectively resembles 
our scheme for merging the original, accuracy-based rank- 
ing with the current dissimilarity ranking, but is more gen- 
eral and does not address the diversity issue. Fagin et al. [7] 
propose metrics for measuring the distance between top-N 
lists, i.e., inter-list similarity metrics, in order to evaluate 
the quality of merged ranks. Oztekin et al. [21] extend the 
linear combination approach by proposing rank combination 
models that also incorporate content-based features in order 
to identify the most relevant topics. 

More related to our idea of creating lists that represent the 
whole plethora of the user’s topic interests, Kummamuru et 
al. [15] present their clustering scheme that groups search 
results into clusters of related topics. The user can then 
conveniently browse topic folders relevant to his search in- 
terest. The commercially available search engine NORTHERN 
LIGHT (http://www.northernlight.com) incorporates similar 
functionalities. Google (http://www.google.com) uses several 
mechanisms to suppress top-N items too similar in content, 
showing them only upon the user’s explicit request. Unfor- 
tunately, no publications on that matter are available. 


7. CONCLUSION 


We presented topic diversification, an algorithmic frame- 
work to increase the diversity of a top-N list of recommended 
products. In order to show its efficiency in diversifying, we 
also introduced our new intra-list similarity metric. 

Contrasting precision and recall metrics, computed both 
for user-based and item-based CF and featuring different 
levels of diversification, with results obtained from a large- 
scale user survey, we showed that the user’s overall liking 
of recommendation lists goes beyond accuracy and involves 
other factors, e.g., the users’ perceived list diversity. We were 
thus able to provide empirical evidence that lists are more 
than mere aggregations of single recommendations, but bear 
an intrinsic, added value. 

Though effects of diversification were largely marginal on 
user-based CF, item-based CF performance improved signif- 
icantly, an indication that there are some behavioral differ- 
ences between both CF classes. Moreover, while pure item- 
based CF appeared slightly inferior to pure user-based CF in 
overall satisfaction, diversifying item-based CF with factors 
Or € [0.3,0.4] made item-based CF outperform user-based 
CF. Interestingly for Or < 0.4, no more than three items 
tend to change with respect to the original list, shown in 
Figure 3. Small changes thus have high impact. 

We believe our findings especially valuable for practical 
application scenarios, since many commercial recommender 
systems, e.g., Amazon.com [16] and TiVo [1], are item-based, 
owing to the algorithm’s computational efficiency. 


8. FUTURE WORK 


Possible future directions branching out from our current 
state of research on topic diversification are rife. 

First, we would like to study the impact of topic diversi- 
fication when dealing with application domains other than 
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books, e.g., movies, CDs, and so forth. Results obtained may 
differ, owing to distinct characteristics concerning the struc- 
ture of genre classification inherent to these domains. For 
instance, Amazon.com’s classification taxonomy for books 
is more deeply nested, though smaller, than its movie coun- 
terpart [34]. Bear in mind that the structure of these tax- 
onomies severely affects the taxonomy-based similarity mea- 
sure c*, which lies at the very heart of the topic diversifica- 
tion method. 

Another interesting path to follow would be to param- 
eterize the diversification framework with several different 
similarity metrics, either content-based or CF-based, hence 
superseding the taxonomy-based c*. 

We strongly believe that our topic diversification approach 
bears particularly high relevance for recommender systems 
involving sequential consumption of list items. For instance, 
think of personalized Internet radio stations, e.g., Yahoo’s 
Launch (http://launch. yahoo.com): community members are 
provided with playlists, computed according to their own 
taste, which are sequentially processed and consumed. Con- 
trolling the right mix of items within these lists becomes vi- 
tal and even more important than for mere “random-access” 
recommendation lists, e.g., book or movie lists. Suppose such 
an Internet radio station playing five Sisters of Mercy songs 
in a row. Though the active user may actually like the re- 
spective band, he may not want all five songs played in se- 
quence. Lack of diversion might thus result in the user leav- 
ing the system. 

The problem of finding the right mix for sequential con- 
sumption-based recommenders takes us to another future di- 
rection worth exploring, namely individually adjusting the 
right level of diversification versus accuracy tradeoff. One 
approach could be to have the user himself define the de- 
gree of diversification he likes. Another approach might in- 
volve learning the right parameter from the user’s behavior, 
e.g., by observing which recommended items he inspects and 
devotes more time to, etc. 

Finally, we are also thinking about diversity metrics other 
than intra-list similarity. For instance, we envision a metric 
that measures the extent to which the top-N list actually 
reflects the user’s profile. 
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